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Abstract 

Multicore shared cache processors pose a challenge for designers of embedded systems who 
try to achieve minimal and predictable execution time of workloads consisting of several jobs. 
To address this challenge the cache is statically partitioned among the cores and the jobs are 
assigned to the cores so as to minimize the makespan. Several heuristic algorithms have been 
proposed that jointly decide how to partition the cache among the cores and assign the jobs. 
We initiate a theoretical study of this problem which we call the joint cache partition and job 
assignment problem. 

By a careful analysis of the possible cache partitions we obtain a constant approximation 
algorithm for this problem. For some practical special cases we obtain a 2-approximation algo- 
rithm, and show how to improve the approximation factor even further by allowing the algorithm 
to use additional cache. We also study possible improvements that can be obtained by allowing 
dynamic cache partitions and dynamic job assignments. 

We define a natural special case of the well known scheduling problem on unrelated machines 
in which machines are ordered by "strength" . Our joint cache partition and job assignment 
problem generalizes this scheduling problem which we think is of independent interest. We give 
a polynomial time algorithm for this scheduling problem for instances obtained by fixing the 
cache partition in a practical case of the joint cache partition and job assignment problem where 
job loads are step functions. 

1 Introduction 

We study the problem of assigning n jobs to c cores on a multi-core processor, and simultanously 
partitioning a shared cache of size K among the cores. Each job j is given by a non-increasing 
function Tj{x) indicating the running time of job j on a core with cache of size x. A solution is a 
cache partition p, assigning p{i) cache to each core i, and a job assignment S assigning each job j 

c 

to core S{j). The total cache allocated to the cores in the solution is K, that is Yl Pi^) — The 

1=1 

makespan of a cache partition p and a job assignment S is maxj 'Yj\s(^j-^=iTj{p[i)). Our goal is to 
find a cache partition and a job assignment that minimize the makespan. 
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Multi-core processors are the prevalent computational architecture used today in PC's, mobile 
devices and high performance computing. Having multiple cores running jobs concurrently, while 
sharing the same level 2 and/or level 3 cache, results in complex interactions between the jobs, 
thereby posing a significant challenge in determining the makespan of a set of jobs. Cache parti- 
tioning has emerged as a technique to increase run time predictability and increase performance on 
multi-core processors [U [H] . Theoretic research on online multi-core caching shows that the cache 
partition (which may be dynamic) has more influence on the performance than the eviction policy 
[5l[l2]. To obtain effective cache partitions, methods have been developed to estimate the running 
time of jobs as a function of allocated cache, that is the functions Tj{x) (see for example the cache 
locking technique of |10j). 

Recent empirical research [HI [9] suggests that jointly solving for the cache partition among the 
cores and for the job assignment to cores leads to significant improvements over combining separate 
algorithms for the two problems. The papers [111 |9| suggest and test heuristic algorithms for the 
joint cache partition and job assignment problem. Our work initiates the theoretic study of this 
problem. 

We study this problem in the context of multi-core caching, but our formulation and results are 
applicable in a more general setting, where the running time of a job depends on the availability 
of some shared resource (cache, CPU, RAM, budget, etc.) that is allocated to the machines. This 
setting is applicable, for example, for users of a public cloud infrastructure like Amazon's Elastic 
Cloud. When a user decides on her public cloud setup, there is usually a limited resource (e.g. 
budget), that can be spent on different machines in the cloud. The more budget is spent on a 
machine, it runs jobs faster and the user is interested in minimizing the makespan of its set of jobs, 
while staying within the given budget. 

Related Work: Theoretic study of multi-core caching have shown that traditional online paging 
algorithms are not competitive in the multi-core scenario [5l[l2]. Both papers [5l|T2] show that the 
offline decision version of the caching problem is NP-complete, in slightly different models. Much 
of the difficulty in designing competitive online algorithms for multi-core caching stems from the 
fact that the way in which the request sequences of the different cores interleave is dependent on 
the algorithm. An algorithm with good competitive ratio is obtained in [l], when the interleaving 
of the request sequences is fixed. 
For related work on scheduling see Section [2l 

Our results: We present a 36-approximation algorithm for the joint cache partition and job 
assignment problem in Section |3l We obtain this algorithm by showing that it suffices to consider 
a subset of polynomial size of the cache partitions. 

We obtain better approximation guarantees for special cases of the joint cache partition and 
job assignment problem. 

When each job has a fixed running time and a minimal cache demand, we present , in Sec- 
tion m a 2-approximation algorithm, a |-approximation algorithm that uses 2K cache and a |- 
approximation algorithm that uses 3K cache. We call this problem the single load minimal cache 
demand problem. Our |-approximation algorithm is based on an algorithm presented in Section 
14.51 that finds a dominant perfect matching in a threshold graph that has a perfect matching. This 
algorithm and the existence of such a matching are of independent interest. 

We present in Section 14.61 a polynomial time approximation scheme for a special case of the 
single load minimal cache demand problem, in which there is a correlation between the jobs' loads 
and cache demands. Such a model is inspired by practical cases where there is an underlying notion 
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of a job's "hardness" that affects both its load and its cache demand. 

We study, in Section El the case where the load functions of the jobs, Tj{x), are step functions. 
That is, job j takes Ij time to run if given at least xj cache, and otherwise it takes hj > Ij time. 
For the case where there are a constant number of different lj''s and hj^s we reduce the problem to 
the single load minimal cache demand problem and thereby obtain the same approximation results 
as for that problem (Section [5]). 

We define the problem of scheduling on ordered unrelated machines, a natural special case of 
the classical job scheduling problem on unrelated machines. In this problem there is a total order 
on the machines which captures their relative strength. Each job has a different running time on 
each machine and these running times are non-increasing with the strength of the machine. We 
show a reduction from this problem to the joint cache partition and job assignment problem. We 
also give a polynomial time dynamic programming algorithm for a special case of this problem 
that arises when we fix the cache partition in the special case where the number of IjS and hjS is 
constant (Section [5]). 

In section [6] we generalize the joint cache partition and job assignment problem and consider 
dynamic cache partitions and dynamic job schedules. We show upper and lower bounds on the 
makespan improvement that can be gained by using dynamic partitions and dynamic assignments. 

2 The ordered unrelated machines problem 

The ordered unrelated machines scheduling problem is defined as follows. There are c machines 
and a set J of jobs. The input is a matrix T{i,j) giving the running time of job j on machine i, 
such that for each two machines ii < 12 and any job j, T{ii,j) > T{i2,j). The goal is to assign the 
jobs to the machines such that the makespan is minimized. 

The ordered unrelated machines scheduling problem is a special case of scheduling on unrelated 
machines in which there is a total order on the machines that captures their relative strengths. This 
special case is natural since in many practical scenarios the machines have some underlying notion 
of strength and jobs run faster on a stronger machine. For example a newer computer typically 
dominates an older one in all parameters, or a more experienced employee does any job faster than 
a new recruit. 

Lenstra et al [7] gave a 2 approximation algorithm for scheduling on unrelated machines based 
on rounding an optimal fractional solution to a linear program, and proved that it is NP-hard to 
approximate the problem to within a factor better than |. Shchepin and Vakhania [TH] improved 
Lenstra's rounding technique and obtained a 2 — ^ approximation algorithm. It is currently an 
open question if there are better approximation algorithms for ordered unrelated machines than 
the more general algorithms that approximate unrelated machines. 

Another well-studied scheduling problem is scheduling on uniformly related machines. In this 
problem, the time it takes for machine i to run job j is where Ij is the load of job j and Si is 
the speed of machine i. A polynomial time approximation scheme for related machines is described 
in [6]. It is easy to see that the problem of scheduling on related machines is a special case of the 
problem of scheduling on ordered unrelated machines, and therefore the ordered unrelated machines 
problem is NP-hard. 

The ordered unrelated machines problem is closely related to the joint cache partition and 
job assignment problem. Consider an instance of the joint cache partition and job assignment 
problem with c cores, K cache and a set of jobs J such that Tj{x) is the load function of job 
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j. If we fix the cache partition to be some arbitrary partition p, and we index the cores in non- 
decreasing order of their cache ahocation, then we get an instance of the ordered unrelated machines 
problem, where T{i,j) = Tj{p{i)). Our constant approximation algorithm for the joint cache 
partition and job assignment problem, described in Section [3l uses this observation as well as 
Lenstra's 2-approximation for unrelated machines. In the rest of this section we prove that the 
joint cache partition and job assignment problem is at least as hard as the ordered unrelated 
machines scheduling problem. 

We reduce the ordered unrelated machine problem to the joint cache partition and job assign- 
ment problem. Consider the decision version of the ordered unrelated scheduling problem, with c 
machines and n = | J| jobs, where job j takes time T{i,j) to run on machine i. We want to decide 
if it is possible to schedule the jobs on the machines with makespan at most M. 

Define the following instance of the joint cache partition and job assignment problem. This 
instance has c cores, a total cache K = c{c + 1) /2 and n' = n + c jobs. The first n jobs (1 < j < n) 
correspond to the jobs in the original ordered unrelated machines problem, and c jobs are new jobs 
(n + l < j < n + c). The load function Tj{x) of job j, where I < j < n, equals T{x,j) if x < c and 
equals T{c,j) if 2; > c. The load function Tj{x) of job j, where n+l<j<n + c, equals M + 5 if 
X > j — n for some 6 > and equals 00 if x < j — n. Our load functions Tj(x) are non-increasing 
because the original r(i, j)'s are non-increasing in the machine index i. 

Lemma 2.1. The makespan of the joint cache partition and job assignment instance defined above 
is at most 2M + 6 if and only if the makespan of the original unrelated scheduling problem is at 
most M. 

Proof. Assume there is an assignment S' of the jobs in the original ordered unrelated machines 
instance of makespan at most M. We show a cache partition p and job assignment S for the joint 
cache partition and job assignment instance with makespan at most 2M + 6. 

The cache partition p is defined such that p{i) = i for each core i. The partition p uses exactly 
K = c(c+ l)/2 cache. The job assignment S is defined such that for a job j > n, S{j) = j — n and 
for a job j < n, S{j) = S'{j). The partition p assigns i cache to core i, which is exactly enough for 
job n + i, which is assigned to core i by S, to run in time M + 6. It is easy to verify that p,S is a 
solution to the joint cache partition and job assignment instance with makespan at most 2M + 6. 

Assume there is a solution p, S for the joint cache partition and job assignment instance, with 
makespan at most 2M + 6. Job j, such that n < j < n + c, must run on a core with cache at 
least j — n, or else the makespan would be infinite. Moreover, no two jobs ji > n and j2 > n are 
assigned by S the same core, as this would give a makespan of at least 2M + 26. Combining these 
observations with the fact that the total available cache is K = c(c + l)/2, we get that the cache 
partition must be p{i) = i for each core i. Furthermore, each job j > n is assigned by S to core 
j — n and all the other jobs assigned by S to core j — n are jobs corresponding to original jobs in 
the ordered unrelated machines instance. Therefore, the total load of original jobs assigned by S 
to core i is at most M. 

We define S', a job assignment for the original ordered unrelated machines instance, by setting 
S'{j) = S{j) for each j < n. Since S assigns original jobs of total load at most M on each core, it 
follows that the makespaen of S' is at most M. □ 

The following theorem follows immediately from Lemma l2. II 

Theorem 2.2. There is a polynomial-time reduction from the ordered unrelated machines schedul- 
ing problem to the joint cache partition and job assignment problem. 
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The reduction in the proof of Lemma [2 . 1 1 does not preserve approximation guarantees. However 
by choosing 5 carefully we can get the following result. 

Theorem 2.3. Given an algorithm A for the joint cache partition and job assignment problem 
that approximates the optimal makespan up to a factor of 1 + e, for < e < 1, we can construct 
an algorithm for the ordered unrelated machines scheduling problem that approximates the optimal 
makespan up to a factor of 1 + 2e + for any x > 0- 

Proof. We first obtain a ^1 + 2e + i^f_^ ^ -approximation algorithm for the decision version of the 
ordered unrelated machines scheduling problem. That is, an algorithm that given a value M, 
either decides that there is no assignment of makespan M or finds an assignment with makespan 
{l + 2e + j^)M. 

Given an instance of the ordered unrelated machines scheduling problem, we construct an 
instance of the joint cache partition and job assignment as described before lemma 12.11 and set 
g _ 2eM ^ £qj^ arbitrarily small x > 0. We use algorithm A to solve the resulting instance of 
the joint cache partition and job assignment problem. Let p, S be the solution returned by A. We 
define S'{j) = S{j) for each 1 < j < n. If the makespan of S' is at most ^1 + 2e + i^f_^ ^ M we 

return S' as the solution and otherwise decide that there is no solution with makespan at most M. 

If the makespan of the original instance is at most M, then by lemma [2TT] there is a solution to 
the joint cache partition and job assignment instance resulting from the reduction, with makespan 
at most 2M + 6. Therefore p, S, the solution returned by algorithm A, is of makespan at most 
(l + e)(2M + ,5). 

By our choice of 5 we have that (1 + e)(2M -\- 6) < 2M + 26 and therefore each core is assigned 
by S at most one job j, such that j > n. In addition, any job j such that n < j < n + c, must 
run on a core with cache at least j — n, or else the makespan would be infinite. Combining these 
observations with the fact that the total available cache is K = c{c + l)/2, we get that the cache 
partition must be p{i) = i for each core i. Furthermore, each job j > n is assigned by S to core 
j — n and all the other jobs assigned by S to core j — n are jobs corresponding to original jobs 
in the ordered unrelated machines instance. Therefore, the total load of original jobs assigned by 
S to core i is at most (1 + e)(2M + 6) — M — 6. It follows that the makespan of S' is at most 

{l + €)i2M + 6)-M-6 = M(l + 2€ + j^y 

We obtained a ^1 + 2e + -approximation algorithm for the decision version of the ordered 

unrelated machines scheduling problem. In order to approximately solve the optimization problem, 
we can perform a binary search for the optimal makespan using the approximation algorithm for 
the decision version of the problem and get a ^1 + 2e + ^ -approximation algorithm for the 

n 

optimization problem. We obtain an initial search range for the binary search by using ^ T{c,j) as 

n 

an upper bound on the makespan of the optimal schedule and - ^ T{c,j) as a lower bound. (See 

section 14.41 for a detailed discussion of a similar case of using an approximate decision algorithm in 
a binary search framework to obtain an approximate optimization algorithm.) □ 
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3 A constant approximation algorithm 



We first obtain an 18- approximation algorithm that for the joint cache partition and job assignment 
problem that uses (1 + ^e)K cache for some constant < e < ^. We then show another algorithm 
that uses K cache and approximates the makespan up to a factor of 36. 

Our first algorithm, denoted by A, enumerates over a subset of cache partitions, denoted by 
P{K, c, e). For each partition in this set A approximates the makespan of the corresponding schedul- 
ing problem, using Lenstra's algorithm, and returns the partition and associated job assignment 
with the smallest makespan. 

Let K' = {1 + e)ri°gi+.(^)l, the smallest integral power of (1 -|- e) which is at least K. The set 
P{K, c, e) contains cache partitions in which the cache allocated to each core is an integral power 
of (1 -|- e) and the number of different integral powers used by the partition is at most log2(c). We 
denote by b the number of different cache sizes in a partition. Each core is allocated ^-^^ cache, 

where Ij € N and 1 < j < b. The smallest possible cache allocated to any core is the smallest 
integral power of (1 -|- e) which is at least ^ and the largest possible cache allocated to a core 
is K'. We denote by o",- the number of cores with cache at least — ^^—t-. It follows that there are 

((Tj — <7j-i) cores with ^^^y^. cache. We require that &j is an integral power of 2 and that the total 

cache used is at most (l -|- |e) K. Formally, 

P{K, c, e) = {(Z = < Zi, . . . , /fe >, 6- =< ao, 0-1, . . . , o-fo >) I 6 G N, 1 < 6 < logg c (1) 
Vj, G N, < < logi+, (^) + 1, Vj, Ij+i > I, (2) 
Vj 3uj G N s.t. dj = 2"^, (To = 0, (Tft < c, Vj o-j+i > dj (3) 

+ (4) 

When the parameters are clear from the context, we use P to denote P{K, c, e). Let M{p, S) denote 
the makespan of cache partition p and job assignment S. The following theorem specifies the main 
property of P, and is proven in the remainder of this section. 

Theorem 3.1. Letp,S be any cache partition and job assignment. There are a cache partition 
and a job assignment p, S such that p (z P and M{p, S) < 9M{p, S). 

An immediate corollary of Theorem 13.11 is that algorithm A described above finds a cache 
partition and job assignment with makespan at most 18 times the optimal makespan. 
Lemma 13.21 shows that ^ is a polynomial time algorithm. 

Lemma 3.2. The size of P is polynomial in c. 

Proof. Let (/, d) G P. The vector (T is a strictly increasing vector of integral powers of 2, where each 
power is at most c. Therefore the number of possible vectors for d is bounded by the number of 
subsets of {2°, . . . , 2i°g2{c)} ^hich is 0(2'°S2 ") = 0(c). The vector / is a strictly increasing vector of 
integers, each integer is at most logi_|_^(-) -|- 1. Therefore the number of vectors I is bounded by the 

1 (C) '°B2(f) 

number of subsets of integers that are at most logi+,(f ) + 1 which is 0(2'°Si+a7i) = 0(2'°g2(i+^) ) = 

l°g2(f) 

Poly{c) since e is a constant. Therefore \P\ = O(c2'°s2{i+0 ). □ 



6 



Let {p, S) be a cache partition and a job assignment that use c cores, K cache and have a 
makespan M{p,S). Define a cache partition pi such that for each core i, if p{i) < ^ then 
— ^ ^■^d if p{i) > then pi{i) = p{i)- For each core i, pi{i) < p{i) + ^ and hence the 
total amount of cache allocated by pi is bounded by (1 + ()K. For each core i, pi{i) > p{i) and 
therefore M{pi,S) < M{p,S). 

Let p2 be a cache partition such that for each core i, p2 

(i) = (1 + e)riogi+.{piW)l, the smallest 
integral power of (1 + e) that is at least pi{i). For each i, P2{i) > Pi{i) and thus M{p2,S) < 
M{pi,S) < M{p,S). We increased the total cache allocated by at most a multiplicative factor of 
(1 + e) and therefore the total cache used by p2 is at most (1 + e)^K < (1 + ^e)K since e < ^. 

Let (p be any cache partition that allocates to each core an integral power of (1 + e) cache. We 
define the notion of cache levels. We say that core i is of cache level I in (p if (p{i) = (j^^- Lst 
ci(p>) denote the number of cores in cache level I in The vector of q's, which we call the cache 
levels vector of ip, defines the partition (p completely since any two partitions that have the same 
cache level vector are identical up to a renaming of the cores. 

I 

Let (T{ip) be the vector of prefix sums of the cache levels vector of tp. Formally, ai{(p) = ^ Ci{(p). 

i=0 

Note that ai{ip) is the number of cores in cache partition (p with at least (^^^ cache and that for 
each I, ui{ip) < c. 

For each such cache partition p, we define the significant cache levels li{ip) recursively as follows. 
The first significant cache level li{(p) is the first cache level I such that ci{(p) > 0. Assume we already 
defined the i — 1 first significant cache levels and let I' = li-i{ip) then li{ip) is the smallest cache 
level I > I' such that ai{(p) > 2ai'{ip). 

Lemma 3.3. Let Ij and Ij^i be two consecutive significant cache levels of ip, then the total number 
of cores in cache levels in between Ij and Ij+i is at most ai.{(p). Let l^ be the last significant cache 
level of ip then the total number of cores in cache levels larger than lb is at most ui^ {pi) . 

ij+i-i 

Proof. Assume to the contrary that ^ Cf{ip>) > (Ji i^p)- This implies that for I' = Ij^i — 1, 

(^I'if) > 2cr/^(99) which contradicts the assumption that there are no significant cache levels in 
between Ij and Zj+i in ip. The proof of the second part of the lemma is analogous. □ 

Let q = ci{p2). For each core i, ^ < p2{i) < K', so we get that if / is a cache level in p2 
such that Q 7^ then < ^ < log;L_,_^(|) + 1. Let ai = ai{p2) and a =< ai,...,(7i,' >, where 
b' = logi_|_g(|) + 1. Let li = li(j>2), for 1 < i < 6, where b is the number of significant cache levels 
in p2. 

We adjust p2 and S to create a new cache partition p3 and a new job assignment S3. Cache 
partition p^ has cores only in the significant cache levels li, . . . ,l{j of p2- Wc obtain p^ from p2 as 
follows. Let / be a non-significant cache level in p2. If there is a j such that Ij-i < f < Ij then we 
take the c/ cores in cache level / in p2 and reduce their cache so they are now in cache level Ij in 
Ps- li f > lb then we remove the Cf cores at level / from our solution. It is easy to check that the 
significant cache levels of ps are the same as of p2, that is li, . . . ,lb- Since we only reduce the cache 
allocated to some cores, the new cache partition p^ uses no more cache than p2 which is at most 
(1 + §e)X. 

We construct 6*3 by changing the assignment of the jobs assigned by S to cores in non-significant 
cache levels in p2. As before, let / be a nonsignificant cache level and let Ij-i be the maximal 



7 



significant cache level such that < /. For each core i in cache level / in p2 we move all the 
jobs assigned by S to core i, to a target core in cache level Ij-i in p^. Lemma 13.41 specifies the key 
property of this job-reassignment. 

Lemma 3.4. We can construct S3 such that each core in a significant level of ps is the target of 
the jobs from at most two cores in a nonsignificant level of p2- 

Proof. Let denote the cache levels vector of and let denote the vector of prefix sums of . 
From the definition of p^ follows that for all j, a^, = ai-, and that for j > 1, c^. = a^, — af, ^ = 

By Lemma 13.31 the number of cores in nonsignificant levels in p2 whose jobs are reassigned to 
one of the cp, cores in level Ij in p3 is at most ai. . So for j > 1 the ratio between the number of 

cores whose jobs are reassigned to the number of target cores in level Ij in p^ is at most — ^ = 

1 H — < 2. For j = 1 the number of target cores in level li of p^ is cf = which is at least 

as large as the number of cores at nonsignificant levels between li and I2 in p2 so we can reassign 
the jobs of a single core of a nonsignificant level between li and I2 in p2 to each target core. □ 

Corollary 3.5. M(p3,53) < 2,M{p,S) 

Proof. In the new cache partition p^ and job assignment 53 we have added to each core at a 
significant level in p^ the jobs from at most 2 other cores at nonsignificant levels in p2. The target 
core always has more cache than the original core, thus the added load from each original core is 
at most M(p2, S). It follows that M(p3, 53) < 3M(p2, S) < 3M{p, S). □ 

Let c^ denote the cache levels vector of p^ and let denote the vector of prefix sums of c^. 
We now define another cache partition p based on p^. Let uj = [log2((Tf. )J . The partition p has 
2"i cores in cache level h, and 2"^ — 2"^-^ cores in cache level Ij for 1 < j < b. The cache levels 
li, . . . ,1), are the significant cache levels of p and p has cores only in its significant cache levels. Let 
ci- denote the number of cores in the significant cache level Ij in p. 

Lemma 3.6. 3ci. > cf 

Proof. By the definition of uj, we have that 2"^ < af, < 2"^+^. So for j > 1 

cij 2"^ - 2"^-i 2"J - 2"j-i 2"J 



cf Cj — u J 

ij ij 1 j _ 1 



> = {<\ 



Since Ij and Ij-i are two consecutive significant cache levels we have that Uj — uj^i > 1. The ratio 
in [5] is an increasing function of uj — tij-i and thus minimized by uj — = 1, yielding a lower 

bound of i. For j = 1, ^ = 1^ > = i. □ 



Lemma 13.61 shows that the cache partition p has in each cache level Ij at least a third of the 
cores that ps has at level Ij. Therefore, there exists a job assignment S that assigns to each core 
of cache level Ij in p the jobs that S3 assigns to at most 3 cores in cache level Ij in p^. We only 
moved jobs within the same cache level and thus their load remains the same, and the makespan 
M{p,s) < 3M{p3,S3) < 9M{p,s). 
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Lemma 3.7. Cache partition p is in the set P{K,c,e). 

Proof. Let a be the vector of prefix sums of c. The vectors < li, . . . ,lb >,< ai-^^, . . . , ai^^ > clearly 
satisfy properties 1-3 in the definition of c, e). It remains to show that p uses at most {l + ^e)K 
cache (property 4). 

Consider the core with the xth largest cache in p. Let Ij be the cache level of this core. Thus 
<7/j > X. Since ai- is the result of rounding down af, to the nearest integral power of 2, we have 
that ai . < af, . It follows that af, > x and therefore the core with the xth largest cache in p^ is in 
cache level Ij or smaller and thus is it has at least as much cache as the xth largest core in p. So p 
uses at most the same amount of cache as p^ which is at most (1 + |e)i^r. □ 

This concludes the proof of Theorem 13. H and establishes that our algorithm A is an 18- 
approximation algorithm for the problem, using (1 -|- |e)-fC cache. 

We provide a variation of algorithm A that uses at most K cache, and finds a 36-approximation 
for the optimal makespan. Algorithm B enumerates on r, 1 < r < ii', the amount of cache allocated 
to the first core. It then enumerates over the set of partitions P = P{ ^2^ , [|] — 1, |). For each 
partition in P it adds another core with r cache and applies Lenstra's approximation algorithm on 
the resulting instance of the unrelated machines scheduling problem, to assign all the jobs in J to 
the cores. Algorithm B returns the partition and assignment with the minimal makespan it 
encounters. 

Theorem 3.8. If there is a solution of makespan M that uses at most K cache and at most c cores 
then algorithm B returns a solution of makespan 36M that uses at most K cache and at most c 
cores. 

Proof. Let [p, S) be the a solution of makespan M, K cache and c cores. W.l.o.g. assume that the 
cores are indexed according to the non-increasing order of their cache allocation in this solution, 
that is p{i -|- 1) > p{i). 

Let J' = {j £ J \ S{j) > 3}. Consider the following job assignment S' of the jobs in J' to the 
cores of odd indices greater than 1 in (p, S)/ The assignment S' assigns to core 2i — 1, for i >2, all 
the jobs that are assigned by S to cores 2i — 1 and 2i. Note that all the jobs assigned by S' to some 
core are assigned by 5 to a core with at most the same amount of cache and thus the makespan of 
S' is at most 2M. 

Assume r = p(l). ThenK = r+ ^ p{i)+p(i — l)>r+ Yl 2p(i) since p is non-increasing. 

oddi>3 oddi>3 

Therefore we get that ^ p{i) < Therefore we can assign the jobs in J' to \^~\ — 1 cores 

oddi>1i 

with a total cache of such that the makespan is at most 2M. By Theorem 13 -H there is a 

partition p' G P{^^, [§] - 1, |) that allocates at most (1 + ff)^^^ = K - r cache to [§] - 1 
cores, and a job assignment S' of the jobs in J' to these cores such that the makespan of p' , S' is 
at most 18M. 

Let p be a cache partition that adds to p' another core (called "core 1") with r cache. The total 
cache used by p is at most K. Let 5 be a job assignment such that S{j) = S'{j) for j G J' and 
for a job j £ J \ J' {a job that was assigned by S either to core 1 or to core 2), S{j) = 1. Since 
the makespan of {p, S) is M we know that the load on core 1 in the solution p, S is at most 2M. It 
follows that the makespan of p, S is at most 18M. 

When algorithm B fixes the size of the cache of the first core to be r = p(l), and considers 
p' € P{^^Y^, [|] — 1, |) then it obtains the cache partition p. We know that 5" is a solution to the 
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corresponding scheduling problem with makespan at most 18M. Therefore Lenstra's approximation 
algorithm finds an assignment with makespan at most 36M. □ 



4 Jobs with a single load and a minimal cache demand 

We consider a special case of the general joint cache partition and job assignment problem where 
each job has a minimal cache demand Xj and single load value aj. Job j must run on a core with 
at least xj cache and it contributes a load of aj to the core. We want to decide if the jobs can 
be assigned to c cores, using K cache, such that the makespan is at most m? W.l.o.g. we assume 
m = 1. 

In Section [4.11 we describe a 2- approximate decision algorithm that if the given instance has a 
solution of makespan at most 1, returns a solution with makespan at most 2 and otherwise may fail. 
In Sections 14.21 and 14.31 we improve the approximation guarantee to | and | at the expense of using 
2K and 3K cache, respectively. In Section we show how to obtain an approximate optimization 
algorithm using an approximate decision algorithm and a standard binary search technique. 

4.1 2-approximation 

We present a 2-approximate decision algorithm, denoted by A2- Algorithm A2 sorts the jobs in a 
non-increasing order of their cache demand. It then assigns the jobs to the cores in this order. It 
keeps assigning jobs to a core until the load on the core exceeds 1. Then, A2 starts assigning jobs 
to the next core. Note that among the jobs assigned to a specific core the first one is the most 
cache demanding and it determines the cache allocated to this core by ^2- Algorithm A2 fails if 
the generated solution uses more than c cores or more than K cache. Otherwise, A2 returns the 
generated cache partition and job assignment. 

Theorem 4.1. // there is a cache partition and job assignemtn of makespan at most 1 that use c 
cores and K cache then algorithm A2 finds a cache partition and job assignment of makespan at 
most 2 that use at most c cores and at most K cache. 

Proof. Let Y = {p, S) be the cache partition and job assignment with makespan 1 whose existence 
is assumed by the lemma. Y has makespan 1 so the sum of the loads of all jobs is at most c. Since 
A2 loads each core, except maybe the last one, with more than 1 load it follows that A2 uses at 
most c cores. 

Since Y has makespan 1 the load of each of the jobs is at most 1. Algorithm A2 only exceeds 
a load of 1 on a core by the load of the last job assigned to this core and thus A2 yields a solution 
with makespan at most 2. 

Assume w.l.o.g that the cores in Y are indexed such that for any core i, p{i + 1) < pii)- Assume 
that the cores in A2 are indexed in the order in which they were loaded by A2. By the definition 
of A2 the cores are also sorted by non-increasing order of their cache allocation. Denote by z{i) 
the amount of cache A2 allocates to core i. We show that for all i G {1, . . . , c}, z(i) < p{i). This 
implies that algorithm A2 uses at most K cache. 

A2 allocates to the first core the cache required by the most demanding job so z{l) = maxj Xj. 
This job must be assigned in Y to some core and therefore z{\) < p{l). Assume to the contrary 
that z{i) > p{i) for some i. Each job j with cache demand Xj > p{i) must be assigned in Y to one 
of the first (i — 1) cores, because all the other cores don't have enough cache to run this job. Since 
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Y has makespan 1 we know that ^ aj < {i — 1). Consider ah the jobs with cache demand 

j\xj>p{i) 

at least z{i). Algorithm A2 failed to assign all these jobs to the first (i — 1) cores, and we know 
that A2 assigns more than 1 load to each core. So ^ aj > (i — l). Since z{i) > p{i) and there 

j\xj>z{i) 

is a job with cache demand z{i), we have < Yl which leads to a contradiction. 

j\xj>z{i) j\xj>pii) 

Therefore z{i) < p{i) for all i and algorithm A2 uses at most K cache. □ 
4.2 |-approximation with 2K cache 

We define a job to be large if aj > ^ and small otherwise. Our algorithm A3 assigns one large 

job to each core. Let Sj be the load on core i after the large jobs are assigned. Let = 1 — Sj. 

We process the small jobs by non-increasing order of their cache demand xj, and assign them to 

the cores in non-increasing order of the cores' r^'s. We stop assigning jobs to a core when its load 

exceeds 1 and start loading the next core. Algorithm A3 allocates to each core the cache demand 

2 

of its most demanding job. Algorithm A3 fails if the resulting solution uses more than c cores or 

2 

more than 2K cache. 

Theorem 4.2. // there is a cache partition and job assignment of makespan at most 1 that use c 

cores and K cache then As finds a cache partition and job assignment that use at most 2K cache, 

2 

at most c cores and have a makespan of at most | . 

Proof. Let Y = {p, S) be the cache partition and job assignment with makespan 1 whose existence 
is assumed by the lemma. The existence of Y implies that there are at most c large jobs in our 

input and that the total volume of all the jobs is at most c. Therefore algorithm A-s uses at most 

2 

c cores to assign the large jobs. Furthermore, when As assigns the small jobs it loads each core, 

2 

except maybe the last one, with a load of at least 1 and thus uses at most c cores. Algorithm As 

2 

provides a solution with makespan at most | since it can only exceed a load of 1 on any core by 
the load of a single small job. 

Let z be the cache partition generated by ^3. Let Ci be the set of cores whose most cache 

2 

demanding job is a large job and Cg be the set of cores whose most cache demanding job is a 
small job. For core i G C;, Let ji be the most cache demanding job assigned to core i, so we have 
z{i) = Xj-. The solution Y = {p,S) is a valid solution thus Xj. < p{S{ji)) so z{i) < p{S{ji)). If 

c 

ji, j2 are two large jobs then S{ji) / S{j2) and we get that ^ z(i) < J2 Pi^Ui)) < J2 P{^) = 

ieCi ieCi i=i 

In the rest of the proof we index the cores in the solution of As such that ri > r2 . . . > r^.This 

2 

is the same order in which As assigns small jobs to the cores. In Y we assume that the cores are 
indexed such that p{i) > p{i + 1). We now prove the z{i) < p{i) for any core i G Cg. Assume, to 
the contrary, that for some i. z{i) > p{i). Let a be the cache demand of the most cache demanding 
small job on core i in Y. Let Ji = {j \ aj < ^,Xj > z{i)} and let J2 = {j \ aj < ^,Xj > a)}. Since 
a < p{i) and by our assumption p{i) < z{i) we get that a < z{i) and therefore Ji C J2. 

A3 does not assign all the jobs of Ji to its first {i — 1) cores and therefore the total load of the 

i-l 

jobs in Ji is greater than ^ r;. On the other hand we know that in Y, assignment S assigns all 

1=1 

the jobs in J2 on its first i — 1 cores while not exceeding a load of 1. Thus the total load of jobs 
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in J2 is at most the space available for small jobs on the first {i — 1) cores in solution Y. Since 

i-l 

fi > ?'2 • • • ^ '''c, and since in any solution each core runs at most one large job, we get that ^ ri is 

1=1 

at least as large as the space available for small jobs in any subset of {i — 1) cores in any solution. 
It follows that the total load of jobs in J2 is smaller than in Ji. This contradicts the fact that 
Ji Q J2- 

We conclude that for every i G Cg, z{i) < p{i). This implies that the total cache allocated to 
cores in Cg is at most K. We previously showed that the total cache allocated to cores in Ci is at 
most K and thus the total cache used by algorithm As is at most 2K. 

□ 

4.3 |-approximation with 3K cache, using dominant matching 

We present a I approximate decision algorithm, A4, that uses at most 3K cache. The main 

3 

challenge is assigning the large jobs, which here are defined as jobs of load greater than ^. 

There are at most 2c large jobs in our instance, because we assume there is a solution of 

makespan at most 1 that uses c cores. Algorithm Ai matches these large jobs into pairs, and 

3 

assigns each pair to a different core. In order to perform the matching, we construct a graph G 
where each vertex represents a large job j of weight aj > |. If needed, we add artificial vertices 
of weight zero to have a total of exactly 2c vertices in the graph. Each two vertices have an edge 
between them if the sum of their weights is at most 1. The weight of an edge is the sum of the 
weights of its endpoints. 

A perfect matching in a graph is a subset of edges such that every vertex in the graph is 
incident to exactly one edge in the subset. We note that there is a natural bijection between 
perfect matchings in the graph G and assignments of makespan at most 1 of the large jobs to the 
cores. The c edges in any perfect matching define the assignment of the large jobs to the c cores as 
follows: Let (a, b) be an edge in the perfect matching. If both a and b correspond to large jobs, we 
assign both these jobs to the same core. If a corresponds to a large job and b is an artificial vertex, 
we assign the job corresponding to a to its own core. If both a and b are artificial vertices, we leave 
a core without any large jobs assigned to it. Similarly we can injectively map any assignment of 
the larges jobs of makespan at most 1 to a perfect matching in G: For each core that has 2 large 
jobs assigned to it, we select the edge in G corresponding to these jobs, for each core with a single 
large job assigned to it, we select an edge between the corresponding real vertex and an arbitrary 
artificial vertex, and for each core with no large jobs assigned to it we select an edge in G between 
two artificial vertices. 

A dominant perfect matching in G is a perfect matching Q such that for every i, the i heaviest 
edges in Q are a maximum weight matching in G of i edges. The graph G is a threshold graph 
|13j . and in Section 14.51 we provide a polynomial time algorithm that finds a dominant perfect 
matching in any threshold graph that has a perfect matching. If there is a solution for the given 
instance of makespan at most 1 then the assignment of the large jobs in that solution correspond 

to a perfect matching in G and thus algorithm Ai can apply the algorithm from Section 14.51 and 

3 

find a dominant perfect matching, Q, in G. 

Algorithm A 4 then assigns the small jobs (load < 4) similarly to algorithms A2 and A3 described 

3 2 

in Sections 14. II and H^ respectively. It greedily assigns jobs to a core, until the core's load exceeds 1. 

Jobs are assigned in a non-increasing order of their cache demand and the algorithm goes through 
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the cores in a non-decreasing order of the sum of loads of the large jobs on each core. Once all 

the jobs are assigned, the algorithm allocates cache to the cores according to the cache demand of 

the most demanding job on each core. Algorithm A4 fails if it does not find a dominant perfect 

3 

matching in G or if the resulting solution uses more than c cores or more than 3K cache. 

Theorem 4.3. // there is a solution that assigns the jobs to c cores with makespan 1 and uses K 
cache then algorithm A± assigns the jobs to c cores with makespan at most | and uses at most SK 
cache. 

Proof. Let Y = {p, S) be a solution of makespan at most 1, that uses c cores and K cache. 

Algorithm A 4 provides a solution with makespan at most | since it may only exceed a load of 

1 on any core by the load of a single small job. 

Algorithm A4 uses at most c cores to assign the large jobs because the assignment is based on 

3 

a perfect matching of size c in G. The existence of Y implies that the total load of all jobs is at 

most c. When A4 assigns the small jobs it exceeds a load of 1 on all cores it processes, except 
3 

maybe the last one, and therefore we get that A4 uses at most c cores. 

3 

Let z be the cache partition generated by A4 . Let Ci be the set of cores whose most demanding 

3 

job is a large job and Cs be the set of cores whose most demanding job is a small job. 

Consider any core i G Q. Let j be the most cache demanding large job assigned to core i. Job 
j runs in solution Y on some core S(j). Therefore z(i) = xj < p{S{j)). Since each core in Y runs 
at most two large jobs, we get that the total cache allocated by our algorithm to cores in Q is at 
most 2K. 

Consider the large jobs assigned to cores according to the dominant perfect matching Q. Denote 
by Si the load on core i after the large jobs are assigned (and before the small jobs are assigned) 

and let = 1 — Sj. W.l.o.g. we assume the cores in ^4 are indexed such that ri > . . . > Vc- For 

3 

c 

every i, ^ is at least as large than this sum in any assignment of the large jobs of makespan at 

l=i 

c 

most 1 because any such assignment defines a perfect matching in graph G and if J2 larger in 

l=i 

some other assignment then Q is not a dominant perfect matching in G. Since the total volume 
of all large jobs is fixed, we get that for every core i the amount of free volume on cores 1 till i, 

i 

^ r;, is maximal and can not be exceeded by any other assignment of the large jobs of makespan 

1=1 

at most 1. 

W.l.o.g we assume that the cores in solution Y = {p,S) are indexed such that p{i) > p{i + 1). 
Let i be any core in Cg- We show that z{i) < p{i). Assume, to the contrary, that z{i) > p{i). 
Let a be the cache demand of the most demanding small job assigned to core i in solution Y. Let 
Ji = {j I aj < ^,Xj > z{i)} and J2 = {j \ aj < ^,Xj > a}. Since a < p{i) < z{i), we get that 
Ji C J2. 

Solution Y assigns all the jobs in J2 to its first {i — 1) cores, without exceeding a makespan of 
1. Therefore the total volume of jobs in J2 is at most the total available space solution Y has on 

i 

its first (i — l) cores after assigning the large jobs. Since we know that for every i, ^ r; is maximal 

1=1 

and can not be exceeded by any assignment of the large jobs of makespan at most 1, we get that 

i 

the total volume of jobs in J2 is at most Algorithm A4 does not assign all the jobs in Ji 

1=1 ^ 
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to its first {i — 1) cores, and since A 4 loads each of the first (i — 1) cores with at least 1, we get 

i 

that the total volume of jobs in Ji is greater than ^ r/. So we get that the total volume of jobs in 

1=1 

J2 is less than the total volume of jobs in Ji but that is a contradiction to the fact that Ji C J2. 
Therefore we get that z{i) < p{i), for every i G Cg. It follows that the total cache allocated by our 
algorithm to cores in Cg is at most K and this concludes the proof that our algorithm allocates a 
total of at most 3K cache to all cores. □ 



4.4 Approximate optimization algorithms for the single load, minimal cache 
model 

We presented approximation algorithms for the decision version of the joint cache partition and job 
assignment problem in the single load and minimal cache demand model. If there is a solution with 
makespan m, algorithms A2, A3 and A4 find a solution of makespan 2m, ^ and that uses K, 

2 3 Z ^ 

2K and cache, respectively. We now show how to transform these algorithms into approximate 
optimization algorithms using a standard binary search technique [7]. 

Lemma 4.4. Given m, K and c, assume there is a polynomial time approximate decision algorithm 
that if there is a solution of makespan m, K cache and c cores, returns a solution of makespan 
am, j3K cache and c cores, where a and (3 are at least 1. Then, there is a polynomial time 
approximation algorithm that finds a solution of makespan aniopt, 13 K cache and c cores, where 
niopt is the makespan of the optimal solution with K cache and c cores. 

Proof. Let's temporarily assume that the loads of all jobs are integers. This implies that for any 
cache partition and job assignment the makespan is an integer. 

Our approximate optimization algorithm performs a binary search for the optimal makespan 

n 

and maintains a search range [i,f/]. Initially, U = cij and L = \^U~\. Clearly these initial 

values of L and U are a lower and an upper bound on the optimal makespan, respectively. Let 
A be the approximate decision algorithm whose existence is assumed in the lemma's statement. 
In each iteration, we run algorithm A with parameters K, c and m = [ ^~^^ \ ■ If A succeeds and 
returns a solution with makespan at most am we update the upper bound U := m. If A fails, we 
know there is no solution of makespan at most m, and we update the lower bound L := m + 1. It 
is easy to see that the binary search maintains the invariant that after any iteration, if the search 
range is [L, U] then mopt € [L, aU] and we have a solution of makespan at most aU. The binary 
search stops when L = U. 

The makespan of the solution when the binary search stops is at most all = aL < amopt- 

n 

The binary search stops after 0(log2(^ o-j)) iterations, and since A runs in polynomial time, we 

i=i 

get that our algorithm runs in polynomial time. This shows that our binary search algorithm is a 
polynomial time a-approximation algorithm. 

If the loads in our instance are not integers, let ^ be the precision in which the loads are given. 
By multiplying all loads by 2'^ we get an equivalent instance where all the loads of the jobs are 
integers. Note that this only adds (j) iterations to the binary search and our algorithm still runs in 
polynomial time. □ 

The following theorem follows immediately from Lemma |4.4[ 
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Theorem 4.5. Using the approximate decision algorithms presented in this section, we obtain 
polynomial time approximate optimization algorithms for the single load, minimal cache demand 
problem with approximation factors 2, | and | that use K, 2K and 2>K cache, respectively. 

4.5 Dominant perfect matching in threshold graphs 

Let G = (y, E) be an undirected graph with 2c vertices where each vertex x ^ V has a weight 
w{x) > 0. The edges in the graph are defined by a threshold t > to he E = {(x, y) \ 'w{x) + 'w{y) < 
t,x ^ y}. Such a graph G is known as a threshold graph O [13]. We say that the weight of an edge 
(x, y) is w{x, y) = w{x) + w{y). 

A perfect matching A in G is a subset of the edges such that every vertex in V in incident to 
exactly one edge in A. Let Ai denote the i-th heaviest edge in A. We assume, w.l.o.g, that there 
is some arbitrary predefined order of the edges in E that is used, as a secondary sort criteria, to 
break ties in case several edges have the same weight. In particular, this implies that Ai is uniquely 
defined. 

Definition 4.6. A perfect matching A dominates a perfect matching B if for every x G {1, . . . ,c} 

E w{A,) > t w{Bi) 

i=l i=l 

Definition 4.7. A perfect matching A is a dominant matching if A dominates any other perfect 
matching B. 

Let A and B be two perfect matchings in G. We say that A and B share a prefix of length I if 
Ai = Bi for i€{l,. ..,/}. The following greedy algorithm finds a dominant perfect matching in a 
threshold graph G that has a perfect matching. We start with Go = G. At step i, the algorithm 
selects the edge (x, y) with maximum weight in the graph Gj. If there are several edges of maximum 
weight, then is the first by the predefined order on E. The graph Gj+i is obtained from Gj 

by removing vertices x, y and all edges incident to x or y. The algorithm stops when it selected c 
edges and Gc is empty. 

Lemma 4.8. For every x € {0, . . . , c — 1}, If graph Gx has a perfect matching, then the graph G^+i 
has a perfect matching. 

Proof. Let denote the perfect matching in graph Gx. Let (a, b) be the edge of maximum weight 
in Gx that we remove, with its vertices and their incident edges, to obtain G^+i- If {cL,b) € Mx 
then clearly Mx \ {{a,b)} is a perfect matching in Gx+i- If (a, 6) Mx, and since Mx is a perfect 
matching of Gx, there are two vertices c and d such that (a, c) and (6, d) are in Mx- The edge (a, b) 
is the maximum weight edge in Gx and thus w{b) > w{c) and w{a) > w{d). Therefore {c,d) must 
be an edge in Gx because w{c) + w{d) < w{a) + w{b) < t the threshold defining the edges in our 
threshold graph. Let Mx+i = Mx \ {{a, c), {b, d)} U {{c, d)} . It is easy to see that Mx+i is a perfect 
matching of graph Gx+i. □ 

Theorem 4.9. If G is a threshold graph with 2c vertices that has a perfect matching, then the 
greedy algorithm described above finds a dominant perfect matching. 

Proof. Lemma 14.81 implies that our greedy algorithm is able to select a set of c edges that is a 
perfect matching in G. Denote this matching by Q. 
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Assume, to the contrary, that Q is not a dominant perfect matching in G. Let A be a perfect 
matching that is not dominated by Q sharing the longest possible prefix with Q. Let x denote the 
length of the shared prefix of Q and A. Let Gx denote the graph obtained from G by removing 
the X edges that are the heaviest in both A and Q, their vertices and all edges incident to these 
vertices. 

Let (a, 6) = Qx+i- Since A and Q share a maximal prefix of length x, A^+i 7^ (o;^) • Since 

(a, b) is of maximum weight in Gx, it follows that (a, b) ^ A (otherwise, it would have been Ax+i). 

The set of edges {^^.+1, . . . , Ac} form a perfect matching of Gx so there must be two edges and two 

indices li > x and I2 > x, such that Ai-^ = {a,d),Ai^ = (6, c). We assume w.l.o.g. that li < l2- The 

edge (a, 5) is of maximum weight in Gx therefore w{a) > w{c) and w{b) > w{d). It follows that 

w{c, d) < w{a, b) < t, and therefore (c, d) G Gx- Let A' = A\{{a, d), (5, c)} U {(a, b), (c, d)}. Clearly, 

A' is a perfect matching in G, A'^^^ = (a, b) and therefore A' shares a prefix of length x + 1 with 

Q. If A' dominates A, then since Q does not dominate A, it follows that Q does not dominate A' . 

Thus A' is a perfect matching that shares a prefix of length x + 1 with Q ^ind is not dominated by 

Q. This is a contradiction to the choice of A. We finish the proof by showing that A' dominates A. 

I 

Let I3 be the index such that A'l^ = {c,d). Since w{b) > 'w{d), I3 > I2. Let A(/) = Yl ^(^i) ~ 

i=l 

I 

'w{Ai). The matchings A' and A share a prefix of length x, so for every 1 < / < x, A(/) = 0. 

i=l 

For X + 1 < I < li, A(/) = w{a,b) — w{Ai) > since {a,b) is the edge of maximum weight in 
Gx- For h < I < I2, A(/) = w{a,b) — w^ajd) > also by the maximality (a, 6). For I2 < I < hy 
A{1) = w{A'i) — w{c) — w{d) which is non-negative because / < Is and therefore w{Ai) > w{A'i^) = 
w{c) + w{d). For / > /3, A(Z) = 0. This shows that A' dominates A and concludes our proof that 
Q is a dominant perfect matching in G. □ 

4.5.1 On dominant perfect matchings in d-uniform hypergraphs 

The problem of finding a dominant perfect matching in a d-uniform threshold hypergrapl|l| that 
has a perfect matching is interesting in the context of the single load, minimal cache version of 
the joint cache partition and job assignment problem. If we can find such a matching then an 

algorithm similar to Algorithm ^44 in Section 14.31 would give a solution that uses (d + 1)K cache 

3 

and approximates the makespan up to a factor of 

However, the following example shows that in a 3-uniform threshold hypergraph that has a 
perfect matching, a dominant perfect matching does not necessarily exist. Let e > be an arbitrarily 
small constant. Consider a hypergraph with 12 vertices, 3 vertices of each weight in {^, |, | — e, e}. 
Each triplet of vertices is an edge if the sum of its weights is at most 1. This hypergraph has a 
perfect matching. In fact, let's consider two perfect matchings in this hypergraph. Matching A 
consists of the edges (|,|,|), (| — e,| — e, e), (| — e, |,|) and (|, e, e). Matching B consists of three 
edges of the form (^, g , | — e) and one edge of the form (e, e, e). It is easy to check that A and B 
are valid perfect matchings in this hypergraph. Any dominant perfect matching in this hypergraph 
must contain the edge (|, |, in order to dominate A, since this is the only edge of weight 1 in 
this hypergraph. The sum of the two heaviest edges in matching i? is 2 — 2e and therefore any 

^ A d-uniform threshold hypergraph is defined on a set of vertices, V, each with a non-negative weight ■w{v). The 
set of edges, E, contains all the subsets S G V of size d such that the sum of the weights of the vertices in S is at 
most some fixed threshold t > 0. 
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dominant perfect matching must have an edge of weight at least 1 — 2e, as otherwise the matching 
will not dominate matching B. But, if the edge (|, |, |) is in the dominant matching, then all 
edges disjoint from (|, |, |) have a weight smaller than 1 — 2e. Thus no dominant perfect matching 
exists in this hypergraph. 

Matching A in the example above is the perfect matching found by applying the greedy algorithm 
to this hypergraph. It is interesting to note that in a 3-uniform threshold hypergraph, the greedy 
algorithm does not necessarily find a perfect matching at all. This is because Lemma 14.81 does not 
extend to 3-uniform threshold hypergraphs. Let e > be an arbitrarily small constant. Consider 
a hypergraph with 9 vertices, 3 vertices of each weight in |, | — e}- Each triplet of vertices is 
an edge if the sums of its weights is at most 1. This hypergraph has a perfect matching since the 
3 edges of the form |, | — e) are a perfect matching in this hypergraph. However the greedy 
algorithm first selects the edge (|,|,|) and then selects an edge of the form (|,|,^^). The 
remaining hypergraph now contains three vertices and no edges, so the greedy algorithm is stuck 
and fails to find a perfect matching. 

4.6 PTAS for jobs with correlative single load and minimal cache demand 

The main result in this section is a polynomial time approximation scheme for instances of the single 
load minimal cache demand problem, where there is a correlation between the load and the cache 
demand of jobs with non-zero cache demand. This special case is motivated by the observation 
that often there is some underlying notion of a job's "hardness" that affects both its load and its 
minimal cache demand. 

Consider an instance of the single load minimal cache demand problem such that for any two 
jobs such that xj and xj/ are non-zero, aj < Uj' xj < xj/. We call a job j such that 

Xj > a demanding job and a job j such that xj = a non- demanding job. We consider the 
following decision problem: We want to decide if there is a cache partition of K cache to c cores 
and an assignment of jobs to the cores such that the job's minimal cache demand is satisfied and 
that the resulting makespan is at most m? By scaling down the loads of the jobs by m, we assume 
w.l.o.g that m = 1. 

Let e > 0. We present an algorithm that if there is a cache partition and a job assignment 
with makespan at most 1, returns a cache partition and a job assignment with makespan at most 
(1 + 2e). Otherwise, our algorithm either decides that there is no solution of makespan at most 1 
or returns a solution of makespan at most (1 + 2e). Combining this algorithm with a binary search, 
we obtain a PTAS. 

If there is a job j such that aj > 1 then our algorithm decides that there is no solution of 
makespan at most 1. Thus we assume that for any j, Oj < 1. 

Let J = Ji U J2, Ji = {j & J \ Oj > e}, J2 = J\Ji- In the first phase, we deal only with jobs in 
Ji. For each j € Ji let Uj = max{n G N j e + ue^ < aj}. We say that e + uje'^ is the rounded-down 
load of job j. 

Let Ud = {uj I J G Ji, Xj > 0} and Und = {uj \ j £ Ji, xj = 0}. An assignment pattern 
of a core is a table that indicates for each u G Ud how many demanding jobs of rounded-down 
load e + ue^ are assigned to the core and for each u G Und how many non-demanding jobs of 
rounded-down load e + ne^ are assigned to the core. Note that an assignment pattern of a core 
does not identify the actual jobs assigned to the core. We only consider assignment patterns whose 
rounded-down load is at most 1. 
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A configuration of cores is a table indicating how many cores wc have of each possible assignment 
pattern. A configuration of cores T is valid if for every u G Ud, the number of demanding jobs in 
Ji whose Uj = u equals the sum of the numbers of demanding jobs with Uj = u m all assignment 
patterns in T and, similarly, for every u G Und^ the number of non-demanding jobs in Ji whose 
Uj = u equals the sum of the numbers of non-demanding jobs with Uj = u in all assignment patterns 
in T. 

The outline of our algorithm is as follows. The algorithm enumerates over all valid configurations 
of cores. For each valid configuration T, we find an actual assignment of the jobs in Ji that matches 
T and minimizes the total cache used. We then proceed to assign the jobs in J2, in a way that 
guarantees that if there a solution of makespan 1 and K cache that matches this configuration 
of cores, then we obtain a solution of makespan at most (1 + 2e) and at most K cache. If our 
algorithm does not generate a solution of makespan at most (1 + 2e) and at most K cache, for all 
valid configurations of cores, then our algorithm decides that no solution of makespan at most 1 
exists. 

Let T be a valid configuration of cores. For each core i G {1, . . . ,c}, let Qi be the maximal 
rounded-down load of a demanding job assigned to core i according to the assignment pattern of 
core i in T. Let be the number of demanding jobs of rounded-down load qi on core i, according to 
T. We assume w.l.o.g that the cores are indexed such that qi > qiJ^i- Let Q = {qi\i ^ {1, . . . , c}}. 
For each q £ Q, let s{q) be the index of the first core i with qi = q and let e{q) be the index of the last 
core i with gtj = q. Assume that the cores s{q), . . . , e(q) are indexed such that as(q) > . . . > ctf,(q)- 
Let Ji{q) = {j G Ji \ xj 0,e + uje^ = g}, the set of all demanding jobs in Ji whose rounded 

down load is q. Let Y{q) be the set of the Yl '^i jobs of smallest cache demands in Ji{q). 

i=s(q) 

Our algorithm builds an assignment matching T of minimal cache usage among all assignments 
matching T. To do so, our algorithm goes over Q in a decreasing order and distributes the jobs 
in Y{q) to the cores s{q), . . . ,e{q) in this order of the cores such that core i G [s{q),e{q)], in 
turn, gets the most cache demanding jobs in Y{q) that are not yet assigned. After we assign 
the demanding jobs with the maximal rounded-down load on each core, our algorithm arbitrarily 
chooses the identity of all other jobs in the configuration T. These are non-demanding jobs and 
demanding jobs whose rounded-down load is not of the maximal rounded-down load on their core. 
Each core is allocated cache according to the cache demand of the most cache demanding job that 
is assigned to it. 

The algorithm continues with the jobs in J2. It first assigns the demanding jobs in J2, in the 
following greedy manner. Order these jobs from the most cache demanding to the least cache 
demanding. For each core, we consider two load values: its actual load which is the sum of the 
actual loads of jobs in Ji assigned to the core, and its rounded down load which is the sum of 
rounded down loads of jobs in Ji assigned to the core. We order the cores such that first we have 
all the cores that already had some cache allocated to them in the previous phase of the algorithm, 
in an arbitrary order. Following these cores, we order the cores with no cache allocated to them, 
from the least loaded core to the most loaded core, according to their rounded down loads. These 
cores are either empty or have only non demanding jobs, from Ji, assigned to them. The algorithm 
assigns the jobs to the cores in these orders (of the jobs and of the cores) and stops adding more 
jobs to a core and moves to the next one when the core's actual load exceeds 1 + e. After all 
these jobs are assigned, the algorithm adjusts the cache allocation of the cores whose most cache 
demanding job is now a job of J2. 
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Finally, it assigns the non-demanding jobs in J2. Each such job is assigned arbitrarily to a core 
whose actual load does not already exceed 1 + e. 

Lemma 4.10. The number of valid configurations of cores is 0{cOW). 

Proof. We first consider the number of assignment patterns with rounded-down load at most 1. 
Since for each job j, aj < 1, the size of Ud and the size of U^d are at most |_^^J = O(^) = 
In an assignment pattern of load at most 1, there are at most - jobs in Ji assigned to each core 

1 ( ^ ) 

and thus we get that the number of possible assignment patterns is at most 0{{-y^ ) = 0(1). 
Since the number of assignment patterns we consider is 0(1), it follows that the number of possible 
configurations of cores is 0{c'-^^^^). □ 

Since our algorithm spends a polynomial time per configuration of cores then Lemma 14.101 
implies that our algorithm runs in polynomial-time. 

Lemma 4.11. For any configuration of cores T there is an assignment matching T of minimal 
cache usage among all assignments matching T, that for each q Q assigns the Yl '^i ^east cache 

i=s{q) 

demanding jobs in Ji{q) (i.e. the set of jobs Y(q)) to the cores s{q), . . . , e{q). 

Proof. Consider a job assignment S of minimal cache usage that matches T. Assume that for some 
q G Q assignment S does not assign all the jobs in Y{q) to the cores s{q), . . . ,e{q). So there is a 
core i G [s{q),e{q)] that runs a job j G Ji{q) \ y{q)- 

<i) 

Since S assigns Yl '^i Jot's from Ji{q) to cores s{q), . . . , e{q) and since jobs in Ji{q) cannot 

i=s(q) 

be assigned to cores i' > e(g), it follows that there is a core i' < s{q) and a job j' G Y^q) such 
S{j') = i'. Suppose we switch the assignment of jobs j and j' and run job j on core i' and job j' on 
core i. Let S' denote the resulting assignment. The cache required by core i' does not increase, as 
it runs demanding jobs of rounded down load greater than q and therefore of cache demand greater 
than the cache demand of job j. By the choice of the jobs j and j' we know that Xji < Xj and 
therefore the cache required by core i in S' can only decrease compared to the cache required by 
core i in S. It follows that the cache usage of S' is at most that of S and since S is of the minimal 
cache usage of all assignments that match T, we get that the cache usage of S' must be the same 
as of S. 

By repeating this argument as long as there is a job that violates Lemma 14.111 we obtain an 
assignment as required. □ 

Lemma 4.12. For any configuration of cores T, Let S be an assignment matching T such that 
for each q € Q and for each core i E [s{q),e{q)], if we index the jobs in Y{q) from the most 
cache demanding to the least cache demanding, assignment S assigns to core i the jobs in Y{q) of 

i—l i 

indices ^ aj + 1, . . . , ^ aj. Assignment S is of minimal cache usage, among all assignments 

j=s{q) 3=s{q) 
matching T. 

Proof. Assume to the contrary that assignment S is not of minimal cache usage, among all assign- 
ments matching T. Let S' be an assignment whose existence is guaranteed by Lemma [4. Ill Since S 
and S' have different cache usages, there exists q & Q such that S and S' differ on their assignment 
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of the jobs in Y{q). We index the jobs in Y{q) from the most cacn demanding to the least cache 
demanding. Let j € Y{q) be the first job (most cache demanding) in Y{q) such that S{j) ^ S'{j). 
We select S' such that it maximizes j among all assignments satisfying Lemma 14.111 that disagree 
with S on the assignment of the jobs in Y{q). 

Denote i = S{j) and i' = 5"(j). Since S and S' both assign Oj jobs from Y{q) to core i and 
since j is the first job in Y{q) on which S and S' disagree, then there is a job j2 S Y{q), j2 > j 
such that S'{j2) = i- 

We first assume that there is a job ji < j such that S{ji) = i. Let S" be the assignment such 
that S"{j) = i, S"{j2) = i' and for any job h {jj2}, S"{h) = S'{h). The cache required by core 
i' in S" is at most the cache required by core i' in S', since j < j2- Since ji < j and S{ji) = i, 
we know that S'{ji) = i and also S"{ji) = i. This implies that in S" , core i requires the same 
amount of cache as in S' . It follows that S" is also an assignment of minimal cache usage, and 
that it satisfies Lemma 14.111 Since S"{j) = S{j), we get a contradiction to the way we selected S'. 
Thus S is of minimal cache usage, among all assignments matching T. 

We now assume that j is the first job in Y{q) such that S{j) = i. Let S" be the following 
assignment. Any job that is assigned by S' to a core different than i and i' is assigned by S" to the 
same core. For any job x such that S'{x) = i', S"{x) = i. All the Oj' least cache demanding jobs 
assigned by 5' to core i are assigned by S" to core i'. Note that > aj' and therefore assignment 
5" is well defined. 

Since S and S' agrees on the assignment of jobs j < j in Y{q) and assign them to cores I < i, 
then job j is the most cache demanding job assigned to cores I > i hy S' and 5". Therefore in 
assignment S', core i' requires xj cache and in assignment S" core i requires xj cache. In assignment 
S", core i' is assigned a set of jobs that is a subset of the jobs assigned to core i by 5". Thus the 
cache required by core i' in assignment S" , is at most the cache required by core i in assignment 
S' . It follows that S" is also an assignment of minimal cache usage, and that it satisfies Lemma 
14. Ill This contradicts the choice of S' and concludes the proof that assignment S is of minimal 
cache usage, among all assignments matching T. □ 

Corollary 4.13. For each configuration of cores T our algorithm builds an actual assignment of 
minimal cache usage of the jobs in Ji that matches T. 

Proof. The assignment returned by our algorithm is an assignment S, as in the statement of Lemma 

Km □ 

Lemma 4.14. Consider an instance of the correlative single load minimal cache demand problem. 
If there is a cache partition and job assignment that schedules the jobs on c cores, uses at most 
K cache and has a makespan of at most 1 then our algorithm finds a cache partition and job 
assignment that schedules the jobs on c cores, uses at most K cache and has a makespan of at most 
(l + 2e). 

Proof. Let ^ be a solution of makespan at most 1 with c cores and K cache, whose existence is 
assumed by the lemma. Let Ta be the configuration of the cores corresponding to the assignment 
of the jobs in Ji by solution A and assume our algorithm currently considers Ta in its enumeration. 

We show that our algorithm succeeds in assigning all the jobs to c cores. Let's assume to the 
contrary that it fails. It can only fail if all cores are assigned an actual load of more than (1 + e) 
and there are still remaining jobs to assign. This indicates that the total volume to assign is larger 
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than c(l + e), which contradicts the fact that assignment A is able to assign the jobs to c cores 
with makespan at most 1. 

Let S denote the assignment of all jobs on c cores that out algorithm returns when it considers 
Ta- We know that S matches Ta for jobs in Ji. We now show that in S each core has an actual 
load of at most 1 + 2e. When we restrict S to Ji we know that the rounded down load on each core 
is at most 1 and that each core has at most ^ jobs from Ji assigned to it. Since the actual load of 
any job in Ji is at most larger than its rounded down load, we get that if we restrict assignment 

2 

5 to Ji, the actual load on each core is at most 1 + ^ = 1 + The way our algorithm assigns the 
jobs in J2 implies that the actual load of a core in assignment S can only exceed 1 + e by the load 
of a single job from J2. Therefore the actual load on any core in assignment S is at most 1 + 2e. 

We show that assignment S uses at most K cache. Cache is allocated by our algorithm in 
two steps: when it decides on the actual assignment of the jobs in Ji that matches and when 
it assigns the demanding jobs in J2. Lemma 14.131 shows that S restricted to Ji is of minimal 
cache usage of all assignments matching Ta and thus uses at most the same amount of cache as 
assignment A restricted to Ji. 

We show that when we also take into account the demanding jobs in J2, S uses at most the 
same amount of cache as A. Assume the cores in S are indexed according to the order in which 
our algorithm assigns demanding jobs from J2 to them. Assume the cores in A are indexed such 
that core i in S and core i in ^ have the same assignment pattern. For any core in S, we say that 
its free space is (1 + e) minus the sum of the actual loads of all jobs in Ji assigned to it by S. For 
any core in A, we say that its free space is 1 minus the sum of the actual loads of all jobs in Ji 
assigned to it by A. For any i, core i in S has the same rounded down load as core i in A and the 
actual load of core z in S* is at most e larger than the actual load of core i in A. Therefore, by the 
definition of free space, the free space of core i in solution S is at least the free space of core i in 
solution A. 

Let 12 be the number of cores in S that have a demanding job from Ji assigned to them. When 
our algorithm assigns jobs in J2 to a core i < 22; it does not increase the cache required by core i 
since any job in Ji is at least as cache demanding as any job in J2. It follows that the total cache 
required by cores 1, . . . , i2 in S" is at most the total cache required by cores 1, ... ,12 in A. 

Let i > 12 he a core in S whose cache demand is determined by a job from J2. We now show 
that core i in 5 requires no more cache than core i in A. This will conclude the proof that S uses 
at most K cache. 

The total load of demanding jobs in J2 that S assigns to cores 1, . . . , i — 1 is at least the sum 
of the free space of these cores, since our algorithm exceeds an actual load of 1 + e on each core 
before moving the next. The sum of the free space of cores 1, . . . , z — 1 in 5 is at least the sum of 
the free space of the cores 1, . . . , i — 1 in ^, which in turn is an upper bound on the total load of 
demanding jobs from J2 that are assigned in A to cores 1, . . . ,i — 1. Since our algorithm assigns 
the demanding jobs in J2 in a non-increasing order of their cache demand we get that the cache 
demand of the most cache demanding job from J2 on core z in 5 is at most the cache demand of 
the most cache demanding job in J2 on core i in A. □ 

Lemma 14.141 shows that for any e' > 0, we have a polynomial time (1 + 2e')-approximate 
decision algorithm. Given e > 0, by applying our algorithm with e' = e/2 we obtain a polynomial 
time (1 + e)-approximate decision algorithm. 

By using a binary search similar to the one in Lemma 14.41 we obtain an (1 + e)-approximation 
for the optimization problem, using our (1 + e)-approximate decision algorithm. To conclude, we 
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have proven the following theorem. 

Theorem 4.15. There is a polynomial time approximation scheme for the joint cache partition and 
job assignment problem, when the jobs have a correlative single load and minimal cache demand. 

5 Step functions with a constant number of load types 

Empirical studies [3] suggest that the the load of a job, as a function of available cache, is often 
similar to a step-function. The load of the job drops at a few places when the cache size exceeds the 
working-set required by some critical part. In between these critical cache sizes the load of the job 
decreases negligibly with additional cache. The problems we consider in this section are motivated 
by this observation. 

Formally, each job j E J is described by two load values Ij < hj and a cache demand Xj G 
{0, . . . ,K}. If job j is running on a core with at least xj cache then it takes Ij time and otherwise 
it takes hj time. If a job is assigned to a core that meets its cache demand, xj, we say that it is 
assigned as a small job. If it is assigned to a core that doesn't meet its cache demand we say that 
it is assigned as a large job. At first we study the case where the number of different load types is 
constant and then we show a polynomial time scheduling algorithm for the corresponding special 
case of the ordered unrelated machines scheduling problem. 

Let L = {Ij I j G J} and H = {hj \ j £ J}, the sets of small and large loads, respectively. Here 
we assume that \L\ and \H\ are both bounded by a constant. 

For each a £ L, f3 G H, we say that job j is of small type a if Ij = a and we say that job j 
is of large type (3 if hj = j3. If job j is of small type a and large type /3 we say that it is of load 
type (a, j3). Note that jobs ji, j2 of the same load type may have different cache demands Xj^ ^ Xj^ 
and thus if we take cache demands into account the number of different job types is Vt{K) and not 
0(1). 

We reduce this problem to the single load minimal cache demand problem studied in Section 
m For each load type (a,/3), we enumerate on the number, x(a,/3), of the jobs of load type (a, /?) 
that are assigned as small jobs. For each setting of the values x(a,/3) for all load types, we create 
an instance of the single load minimal cache demand problem in which each job corresponds to a 
job in our original instance. For each job j which is one of the x(a, j3) most cache demanding jobs 
of load type (a,/3) we create a job of load /3 and cache demand 0. For each job j of load type 
(a,/3) which is not one of the x(a,/3) most cache demanding job of this load type, we create a 
job of load a and cache demand Xj. We solve each of the resulting instances using any algorithm 
for the single load minimal cache demand problem presented in Section |4l and choose the solution 
with the minimal makespan. We transform this solution back to a solution of the original instance, 
by replacing each job with its corresponding job in the original instance. Note that this does not 
affect the makespan or the cache usage. 

Lemma 5.1. Given a polynomial time a- approximation algorithm for the single load minimal cache 
demand problem that uses at most f3K cache, the reduction described above gives a polynomial 
time a -approximation algorithm for the problem where job loads are step functions with a constant 
number of load types, that uses at most f3K cache. 

Proof. Consider an instance of the joint cache partition and job assignment problem with load 
functions that are step functions with a constant number of load types. Assume there is a solution 
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A for this instance of makespan m that uses at most K cache. Let x{a,(3) be the number of jobs 
of load type (a, /3) that are assigned in A as large jobs. W.l.o.g we can assume that that for each 
(a, /?), the x{a, j3) jobs that are assigned as large jobs are the x(a, /3) most cache demanding jobs of 
load type (a, /3). The existence of A implies that when our algorithm considers the same values for 
x(a,/3), for each (q;,/3), it generates an instance of the single load cache demand problem that has 
a solution of makespan at most m and at most K cache. Applying the a-approximation algorithm 
for the single load minimal cache demand problem, whose existence in assumed by the lemma, on 
this instance yields a solution of makespan at most am that uses at most jiK cache. This solution 
is transformed to a solution of our original instance without affecting the makespan or the cache 
usage. 

Our algorithm runs in polynomial time since the size of the enumeration is 0{Tn}^^^^^). □ 

Corollary 5.2. For instances in which the load functions are step functions with a constant number 
of load types there are polynomial time approximation algorithms that approximate the makespan 
up to a factor of 2, | and | and use at most K , 2K and 3K, respectively. 

5.1 The corresponding special case of ordered unrelated machines 

Recall that if we fix the cache partition in an instance of the joint cache partition and job assignment 
problem then we obtain an instance of the ordered unrelated machines scheduling problem. For 
the case where the load functions are step functions with a constant number of load types, the 
resulting ordered unrelated machines instance can be solved in polynomial time using the dynamic 
programming algorithm described below. The dynamic program follows a structure similar to the 
one used in [3], where polynomial time approximation schemes are obtained for several variants of 
scheduling with restricted processing sets. 

In this special case of the ordered unrelated scheduling problem job j runs in time Ij on some 
prefix of the machines, and in time hj on the suffix (we assume that the machines are ordered in 
non- increasing order of their strength/cache allocation). For simplicity, we assume xj is given as 
the index of the first machine on which job j has load hj. If job j takes the same amount of time 
to run regardless of cache, we assume xj = c + 1 and its load on any machine is Ij. As before, we 
assume that L = {Ij | j € J} and H = {hj | j € J} are of constant size. 

We design a polynomial time algorithm that finds a job assignment that minimizes the makespan. 
The algorithm does a binary search for the optimal makespan, as in Section [4.41 using an algorithm 
for the following decision problem: Is there an assignment of the jobs J to the c machines with 
makespan at most M? By scaling the loads, we assume that M = 1. 

For every machine m, we define Sm = {j € J | Xj = m + 1}, the set of all jobs that are large 
on machine m + 1 and small on any machine i < m. Let Smia, 13) = {j G Sm \ Ij = a, hj = /3} and 
bm{c(,P) = \Sm{(x,(3)\. It is convenient to think of bm as a vector in {0, . . . ,n}^^^ . 

Let a € {0, . . . , n}^^^ , 6 G {0, . . . , n}^ and m be any machine. Let J(m, a) be a set of jobs 
which contains all the jobs in U™ ]^ Si together with additional a(a, j3) jobs of load type (a, j3) from 

c 

[J Si, for each load type (a,/3). Let 7rm(a, 5) be 1 if we can schedule all the jobs in J(m, a), 

i=m+l 

except for 5{I3) jobs of each large load type /?, on the first m machines. Note that since the 
additional jobs specified by a are small on all machines 1, . . . , m, 7rm(a, does not depend on the 
additional jobs' identity. Our original decision problem has a solution if and only if 7rc(0, 0) = 1. 
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Consider the decision problem 7ri(a, 6). We want to decide if it is possible to schedule the jobs in 
J(l,a), except for S{/3) jobs of each large load type /3, on machine 1. To decide this, our algorithm 
chooses the 5{f3) jobs of each large job type /? that have the largest small loads and removes them 
from J(l,a). If the sum of the small loads of the remaining jobs is at most 1, then ni{a,S) = 1, 
and otherwise 7ri(o, 6) = 0. 

To solve TTm{a,6) we enumerate, for each load type {a, (3), on ^(a,/3), the number of jobs in 
J(m, a) of this load type that are assigned as small jobs to machine m. Note that these jobs are 
either in Sm{a,l3) or in the additional set of o(a, /3) jobs of type (a, /3). For each /? G ff, we 
enumerate on the number A(/3) of jobs in J(m, a) of large load type /3 that are assigned as large 
jobs to machine m. The following lemma is the basis for our dynamic programming scheme. Its 
proof is straightforward. 

Lemma 5.3. We can schedule all the jobs in J{m, a) except for jobs of large load type (3 (for 
each 13 ^ H) on machines 1, . . . ,m with makespan at most 1 such that ^(a,/3) jobs of load type 
(a, /3) are assigned to machine m as small jobs and A(/3) jobs of large load type j3 are assigned to 
machine m as large jobs if and only if the following conditions hold: 

• For each (a,/3) G L x H, (^(a, /3) < o(a, /3) + The number of jobs of each load type 
that we assign as small jobs to machine m is at most the number of jobs in J{m,a) of this 
load type that are small on machine m. 

• ^{P)f^ + ^(a,/3)a < 1. The total load of the jobs assigned to machine m is at 

I3£H {a,l3)£LxH 

most 1. 

• Let a' = a + hm — i and 5' = 6 + \ then 7rm-i(a', <5') = 1- The jobs in J{m — 1, a'), except 
for 5' {13) jobs of large load (3 for each (3 ^ H , can be scheduled on machines 1, . . . , m — 1 with 
makespan at most 1. 

The algorithm for solving TTm(a, S) sets TTmicL, 5) = 1 if it finds A and ^ such that the conditions 
in Lemma 15.31 are met. If the conditions are not met for all A and ^ then vrm(a, S) = 0. 

Our dynamic program solves vrm(o, 5) in increasing order of m from 1 to c and returns the result 
of vTc (0,0). The correctness of the dynamic program follows from Lemma 15.31 and from the fact 
that for m = 1, our algorithm chooses the jobs that it does not assign to machine 1 such that the 
remaining load on machine 1 is minimized. Therefore we set 7ri(a, (5) = 1 if and only if there is a 
solution of makespan at most 1. 

By adding backtracking links, our algorithm can also construct a schedule with makespan at 
most 1. We maintain links between each 7rm{a,S) that is 1 to a corresponding TTm~^i(a' ,6') that is 
also 1, according to the last condition in Lemma [5. 31 Tracing back the links from 7rc(0, 0) gives us an 
assignment with makespan at most 1 as follows. Consider a link between 7rm(o, 5) and 7rm-i(a', ^')- 
This defines X = 6' — 6 and ^ = a + b^ — a' . For each {a, 13) we assign to machine m, (,{a,f3) 

c 

arbitrary jobs of load type (a, (3) from |J Si that we have not assigned already, and we reserve 

i=m 

A(/3) slots of load /3 on machine m to be populated with jobs later. Our algorithm guarantees that 
the load on machine m is at most 1. When we reach TTi{a, 5), for some a and 6, in the backtracking 
phase, we have S{(3) slots of size /3 allocated on machines 2, . . . ,m. The 5{f3) jobs of large load (3 
with the largest small loads in J(l, a) are assigned to these slots. Note that these jobs may be large 
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on their machine and have a load of /? or they may be small and have a load smaller than j3. In 
any case, the resulting assignment assigns all the jobs in J and has a makespan of at most 1. 

The number of problems TTm{o,^5) that our dynamic program solves is 0(cn'^"^l) = 0(cn*-^*^^^). 
To solve each problem, we check the conditions in Lemma 15.31 for 0{n^^^'^^^) possible A's and ^'s. 
This takes 0(1) per A and ^ since we already computed -7rm-i(a', 5') for every a' and 5' . Thus the 
total complexity of this algorithm is polynomial. This concludes the proof of the following theorem. 

Theorem 5.4. Our dynamic programming algorithm is a polynomial-time exact optimization al- 
gorithm for the special case of the ordered unrelated machines scheduling problem, where each job 
j has load Ij on some prefix of the machines, and load hj > Ij on the corresponding suffix. 

6 Joint dynamic cache partition and job scheduling 

We consider a generalization of the joint cache partition and job assignment problem that allows 
for dynamic cache partitions and dynamic job assignments. We define the generalized problem as 
follows. As before, J denotes the set of jobs, there are c cores and a total cache of size K. Each 
job J G J is described by a non-increasing function Tj{x). 

A dynamic cache partition p = p(t, i) indicates the amount of cache allocated to core i at time 

.n.t e F„. ea.h ..ne nn. , t •) < !<■ A dyna.. .si^.ent S ^ Si,.) .nd.ates to. each 

core i and time unit t, the index of the job that runs on core i at time t. If no job runs on core 
i at time t then S{t,i) = —1. If S{t,i) = j ^ —1 then for any other core 12 7^ i, S(t,i2) 7^ j- 
Each job has to perform 1 work unit. If job j runs for a time units on a core with x cache, then 
it completes work. A partition and schedule p, S are valid if all jobs complete their work. 

Formally, p, S are valid if for each job j, ^ ^ , \^ .^^ = 1. The load of core i is defined as 

<M>e5-i(j) ^ w ^^)) 

the maximum t such that S{t,i) ^ —1. The makespan p,S is defined as the maximum load on 
any core. The goal is to find a valid dynamic cache partition and dynamic job assignment with a 
minimal makespan. 

It is easy to verify that dynamic cache partition and dynamic job assignment, as defined above, 
generalize the static partition and static job assignment. The partition is static if for every fixed 
core i, p{t,i) is constant with respect to t. The schedule is a static assignment if for every job j, 
there are times ti < t2 and a core i such that = {<t,i >\ti <t < ^2}- 

We consider four variants of the joint cache partition and job assignment problem. The static 
partition and static assignment variant studied so far, the variant in which the cache partition is 
dynamic and the job assignment is static, the variant in which the job assignment is dynamic and 
the cache partition is static and the variant in which both are dynamic. 

Note that in the variant where the cache partition is dynamic but the job assignment is static 
we still have to specify for each core, in which time units it runs each job that is assigned to this 
core. That is, we have to specify a function S{t,i) for each core i. This is due to the fact that 
different schedules of the same set of jobs assigned to a particular core, when the cache partition 
is dynamic, may have different loads, since jobs may run with different cache allocations. When 
the cache partition is also static, the different schedules of the same set of jobs on a particular core 
have the same load, and it suffices to specify which jobs are assigned to which core. 

^To simplify the presentation we assume that time is discrete. 
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We study the makespan improvement that can be gained by allowing a dynamic solution. We 
show that allowing a dynamic partition and a dynamic assignment can improve the makespan by 
a factor of at most c, the number of cores. We also show an instance where by using a dynamic 
partition and a static assignment we achieve an improvement factor arbitrarily close to c. We show 
that allowing a dynamic assignment of the jobs, while keeping the cache partition static, improves 
the makespan by at most a factor of 2, and that there is an instance where an improvement of 2 — | 
is achieved, for c > 2. 

Given an instance of the joint cache partition and job assignment problem, we denote by Oss 
the optimal static cache partition and static job assignment, by Ods the optimal dynamic cache 
partition and static job assignment, by Osd the optimal static cache partition and dynamic job 
schedule and hy Odd the optimal dynamic cache partition and dynamic job schedule. For any 
solution A we denote its makespan by M{A). 

Lemma 6.1. For any instance of the joint cache partition and job assignment problem, M{Oss) ^ 
cM{Odd)- 

Proof. Let A be the trivial static partition and schedule, that assigns all jobs to the first core and 
allocates all the cache to this core. Let's consider any job j that takes a total of a time to run 
in the solution Odd- Whenever a fraction of job j runs on some core with some cache partition, 
it has at most K cache available to it. Therefore, in solution A, when we run job j continuously 
on one core with K cache, it take at most a time. Since the total running time of all the jobs in 
solution Odd is at most cM{Odd), we get M{Oss) < M{A) < cM{Odd)- □ 

Corollary 6.2. For any instance of the joint cache partition and job assignment problem, M{Oss) ^ 
cM{Ods). 

Proof. Clearly, M{Ods) ^ M{Odd) for any instance. Combine this with Lemma l6.ll and we get 
that M{Oss) < cM{Ods) □ 

Lemma 6.3. For any e > there is an instance of the joint cache partition and job assignment 
problem, such that M{Oss) > {c- e)M{ODs)- 

Proof. Let b be an arbitrary constant. Let's consider the following instance with two types of jobs. 
There are c jobs of type 1, such that for each such job j, Tj{x) = oo, for x < K and Tj{K) = 1. 
There are be jobs of type 2, such that for each such job j, Tj{x) = be if x < — and Tj{x) = 1 if 
x> ^. 

— c 

Consider the following solution. The static job assignment runs b jobs of type 2 on each core. 
After b time units, it runs the c jobs of type 1 on core 1. The dynamic cache partition starts with 
each core getting ^ cache. The cache partition changes after b time units and core 1 gets all the 
cache. This solution has a makespan of 6 + c and therefore M{Ods) < ft + c. 

There is an optimal static cache partition and static job assignment that allocates to each core 
0, ^ or ii' cache, because otherwise we can reduce the amount of cache allocated to a core without 
changing the makespan of the solution. This implies that there are only two static cache partitions 
that may be used by this solution optimal static solution: the partition in which p{i) = ^ for each 
core i, and the partition that gives all the cache to a single core. It is easy to see that if we use the 
cache partition where p{i) = we get a solution with an infinite makespan because of the jobs of 
type 1. Therefore this optimal static solution uses a cache partition that gives all the cache to a 
single core. Given this partition, the optimal job assignment is to run all the c jobs of type 1 on 
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the core with all the cache, and assign to that core additional be — {c— 1) jobs type 2. So the load 
on that core is be + 1. Each of the c — 1 cores with no cache is assigned exactly one job of type 2, 
and each such core has a load of be. Therefore the ratio > %r^. The lower bound on this 

M{Ods) — o+c 

ratio approaches c as 6 approaches infinity. Since b is an arbitrarily chosen constant, we can choose 
it large enough such that we get a lower bound that is greater than c — e, for any e > 0. □ 

Corollary 6.4. For any e > there is an instanee of the joint eaehe partition and job assignment 
problem, sueh that M{Oss) > {c - e)M{ODD)- 

Proof. Consider the same instance as in the proof of Lemma 16.31 For that instance, M(Oss) > 
(c — e)M{ODs)- It follows that M{Oss) > (c — £)M{Odd) for the instance in Lemma lOl , since 
M{Ods) > M{Odd). □ 

Lemma 6.5. For any instanee of the joint eaehe partition and job assignment problem, M{Oss) ^ 
2M{Osd). 

Proof. Consider any instance of the joint cache partition and job assignment problem and let 
OsD = {PiS). Let Xij be the fraction of job j's work unit that is carried out by core i. Formally, 
Xij = ^^^^^^x\p{i)) ^^'^^^ • I^^t's consider the instance of scheduling on unrelated machines where job 

c 

j runs on core i in time Tj{p[i)). Since for every job j, ^ Xij = 1 then fractional 

1=1 

assignment for that instance of the unrelated machines scheduling problem. The makespan of this 
fractional solution is M{Osd)- Let y be the optimal fractional assignment of the defined instance 
of unrelated machines. We know that if we apply Lenstra's rounding theorem [7] to y, we get an 
integral assignment for the unrelated machines scheduling instance, denoted by z, such that the 
makespan of z is at most twice the makespan of y and therefore at most twice the makespan of x. 
Assignment z is a static job assignment and therefore (p, z) is a solution to the joint static cache 
partition and static job assignment problem of our original instance, with makespan at most twice 
M{Osd)- It follows that M{Oss) < 2M{0sd)- □ 

Lemma 6.6. For c > 2, there is an instanee of the joint partition and seheduling problem sueh 

Proof. Consider the following instance. There are c jobs, where each takes 1 — ^ time regardless 
of the cache allocation, and one job that takes 1 time unit, regardless of cache. The optimal static 
schedule for this instance assigns two jobs of size 1 — ^ to the first core, assigns one job of size 
1 — i to each of the cores 2, . . . , c — 1, and assigns the unit sized job to the last core. This yields 
a makespan of 2 — -. The optimal dynamic assignment assigns one job of size 1 — - fully to each 
core, and then splits the unit job equally among the cores, to yield a makespan of 1. Notice that 
this can be scheduled in a way the the unit job will never run simultaneously on more than one 
core. This is achieved by running the ith fraction of size ^ of the unit job on core i at time 
The other jobs, that are fully assigned to a single core, are paused and resumed later, if necessary. 



M(Oss) 
M{Osd) 

exactly 2 - ^. □ 



to accommodate the fractions of the unit sized job. Therefore in this instance the ratio ,,)r^^^\ is 

•I M{OsD) 
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